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Biologists possess the detailed knowledge critical for extracting biological insight from genome-wide data resources, and 
yet they are increasingly faced with nontrivial computational analysis challenges posed by genome-scale methodologies. 
To lower this computational barrier, particularly in the early data exploration phases, we have developed an interactive 
pattern discovery and visualization approach, Spark, designed with epigenomic data in mind. Here we demonstrate 
Spark's ability to reveal both known and novel epigenetic signatures, including a previously unappreciated binding 
association between the YY1 transcription factor and the corepressor CTBP2 in human embryonic stem cells. 



[Supplemental material is available for this article.] 

A pressing challenge arising from the productivity of large-scale 
data-generating consortia, such as the Encyclopedia of DNA Ele- 
ments (ENCODE) Project (The ENCODE Project Consortium 2012) 
or the Roadmap Epigenomics Project (Bernstein et al. 2010), is 
ensuring that these data are accessible to the biological community 
for analysis. While public repositories provide easy access to pri- 
mary data, subsequent data processing and analysis can pose 
a significant computational hurdle to many biologists. In addition, 
the depth and breadth of these resources are unprecedented, and 
much of the initial analysis may be exploratory in nature. The 
biologically interesting signals may be too poorly understood at 
the outset to be identified and analyzed in an automated fashion. 
Visualization is a powerful approach in such cases. Not only does it 
lower the computational barrier for use, but also it is particularly 
effective in facilitating human reasoning about complex data, 
which is essential during this early exploration phase. 

Genome browsers are one such class of visualization tool that 
have enjoyed widespread popularity among biologists and that 
frequently serve as the primary means of examining genome-wide 
data during the initial inspection and discovery phases. Part of 
their power comes from the ability to integrate diverse data sets by 
plotting them as vertically stacked 'tracks' across a common ge- 
nomic *-axis. Genome browsers have played an important role in 
increasing the accessibility of large public data sets, for example, 
the ENCODE data resource is currently hosted by the UCSC Ge- 
nome Browser (Kent et al. 2002). 
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However, the power of genome-wide data sets is in their ability 
to reveal global regulatory patterns that would be difficult, if not 
impossible, to extrapolate from studies of individual loci. Genome 
browsers inherently limit the data view to individual loci, and while 
invaluable for visualizing data patterns at specific regions of interest, 
they have limited power to facilitate global analysis. For many types 
of queries, there is a mismatch between the level of data abstraction 
at which the investigator wishes to interrogate the data set (e.g., 
gene set) and the level at which the data are displayed in a genome 
browser (e.g., individual gene). As a result, computational experts 
typically conduct such global analyses with custom tools. Recently, 
the Human Epigenome Browser (Zhou et al. 2011) enabled users to 
filter the genomic *-axis to only annotated genes involved in a 
pathway of interest, as queried by a KEGG identifier. This is an im- 
portant step toward replacing the genome coordinate axis with 
a functional axis and enabling comparisons of data tracks across 
multiple loci within the genome browser framework, but depending 
on the size of the gene set, it can still be challenging to obtain an 
overview of the data patterns from such a view. 

There are several good examples of computational methods 
that generate biologically meaningful genome-wide data summa- 
ries. One common approach used to interpret epigenomic data, 
such as histone modifications and DNA methylation, is to identify 
and functionally characterize combinatorial data patterns. For 
example, methylation of both lysine 4 and lysine 27 on histone H3 
is an epigenetic signature characteristic of embryonic stem cells, 
termed a 'bivalent domain/ thought to silence developmental 
genes while keeping them poised for activation (Azuara et al. 2006; 
Bernstein et al. 2006). Early work in signature detection clustered 
well-annotated promoters on the basis of specific histone modifi- 
cation patterns derived from chromatin immunoprecipitation 
(ChIP) coupled microarray data (ChlP-chip) (Heintzman et al. 
2007). Both seqMINER (Ye et al. 2011) and Cistrome (Liu et al. 
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2011) are analysis tools that include such a clustering approach 
and provide cluster visualization through static heatmaps. A prob- 
abilistic method, ChromaSig, subsequently eliminated the de- 
pendence on existing annotations and offered a way to discover 
chromatin signatures de novo by searching genome-wide using 
data from ChIP followed by sequencing (ChlP-seq) (Hon et al. 
2008). More recently hidden Markov model (HMM), and Bayesian 
network approaches have been applied to uncover recurrent 
chromatin states (Ernst and Kellis 2010; Hoffman et al. 2012). 
However, none of these approaches support interactive data 
exploration. 

All of the above tools produce static summary images, typi- 
cally in the form of heatmaps and there are few or no mecha- 
nisms by which to dynamically guide the analysis based on hu- 
man knowledge of the biological system under study. Here we 
present Spark, a visualization approach that employs clustering to 
create a global data overview and high-level entry point for anal- 
ysis, while also enabling interactive drill-down to the supporting 
data at the level of individual loci. It is intended to facilitate re- 
sponsive exploratory navigation through a genome-wide data set 
and to be used as a complement to genome browsing. Its novelty 
over existing tools lies in its support of user-guided clustering, 
specifically enabling users to split existing clusters into subclusters 
and thus direct the clustering algorithm toward patterns of in- 
terest. Given that the clusters are generated across a set of user- 
specified input regions, Spark supports the analysis of both well- 
annotated regions and potential novel elements, such as those 
identified as having enrichments in a particular ChlP-seq experi- 
ment. The tool is connected to popular external resources, for ex- 
ample, the display links individual loci to the corresponding view 
in the UCSC Genome Browser, and gene ontology (GO) analysis is 
available at the cluster level by interfacing with the DAVID suite of 
tools (Huang et al. 2009) and thus minimizes the need for pro- 
grammatic data manipulation. Spark employs a very general clus- 
tering technique with few parameters and can therefore flexibly 
handle diverse data sets. The ENCODE and Human Epigenome 
Atlas data sets are directly accessible through the Spark user in- 
terface, and initial results suggest that Spark will be a valuable ex- 
ploratory tool for these communities. 

Results 

Availability and installation 

Spark is a Java application for all platforms and is currently avail- 
able from http://www.sparkinsight.org. A sample clustering anal- 
ysis is packaged with Spark and can be loaded from the initial 
launch screen or from the Help menu. We provide a built-in user 
guide and tutorial video, also linked from the initial launch screen 
and Help menu. All of these supporting resources are additionally 
available from the above Spark website. 

The preprocessing and clustering steps of Spark are avail- 
able as command-line utilities to facilitate batch processing 
if desired. For convenience, we have run the Spark preprocess- 
ing step on all 1800 Epigenome Atlas files (Release 7; http:// 
www.epigenomeatlas.org) using the set of reference regions 
available in Spark and default parameters. This enables Spark 
to load these resources in a much shorter time. 

In addition to being deployed as a standalone package, Spark 
is also available as a service within the Epigenome toolset of the 
Genboree Workbench (http://www.genboree.org) (Challis et al. 

2012) . The Genboree deployment enables analysis of any private 



or public data hosted at Genboree. It also supports simultaneous 
processing of several Spark clustering analyses, which is not pos- 
sible with the standalone tool. A tutorial video demonstrating 
these features is available from the Spark website. 

Questions and comments about Spark can be directed to the 
Spark Google Group: http://groups.google.com/group/spark_users/. 

Overview 

A Spark analysis begins with two user inputs: (1) one or more data 
files and (2) a set of regions. Wiggle/big Wig and GFF3 formats are 
accepted for these two inputs, respectively. Within Spark's graph- 
ical user interface (GUI), a user can either select files from the listed 
ENCODE and Epigenome Atlas data resources or can specify their 
own data files either as URLs or by browsing their local file system. 
The user-specified regions can be any set of genomic coordinates, 
for example, the regions flanking known transcriptional start site 
(TSS) annotations or defined by a set of ChlP-seq enrichment 
peaks. Several human reference region sets are also available 
through the GUI. Spark extracts data matrices from the specified 
regions, which are then binned and normalized (Fig. 1, step 1). 
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Figure 1. The Spark workflow. In step 1, the user's input data and re- 
gions of interest are preprocessed to enable rapid data retrieval in later 
steps. (Gray) Data enrichment peaks for two data samples; (vertical black 
boxes) user's regions of interest (r1-r5) centered on transcriptional start 
sites (TSSs). A data matrix is extracted for each input region and oriented 
according to strand. Rows in these matrices correspond to data samples, 
while the columns represent data bins along the genomic x-axis; two bins 
per region are used in this diagram. The values are then normalized to be 
between 0 and 1 , represented here by white and dark blue, respectively. In 
step 2, the matrices are clustered, k = 2 in this diagram, resulting in two 
clusters (cl and c2). In step 3, the clusters and their region members are 
viewed in the Spark interactive visualization interface. 
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These values form the basis of the clustering and are written to 
a binary file for faster future reloading. 

The preprocessed data are then clustered using /c-means 
clustering (Fig. 1, step 2) using a user specified number of clusters 
(/c). This technique was chosen for its effectiveness, its relative 
simplicity, and runtime speed. Clusters are also written to text files 
for reuse. 

Finally, the analysis output is displayed in the Spark GUI (Fig. 
1, step 3). The interactive visualization encompasses two core 
components: a cluster overview panel, which provides a summary 
of each cluster, and the region browser. For a video demonstration 
of the interface, see the Spark website. In the cluster overview 
panel, clusters are initially sorted from left to right by decreasing 
number of regions. Each cluster is represented by a heatmap 
computed by averaging the data matrices from the cluster's 
member regions. A histogram immediately above the cluster panel 
indicates the number of regions per cluster. When a user selects 
a given cluster, data matrices from the cluster's regions are dis- 
played as heatmaps in the region browser, where they are sorted by 
chromosome position. The genome coordinates are displayed be- 
low each individual region in the browser, and a context menu 
provides a hyperlink to that region in the UCSC Genome Browser. 
The interface is also equipped with search functionality, enabling 
a user to easily locate a region of interest within the clustering. 

Interactive cluster refinement 

The general problem of finding a globally optimal partitioning of 
^-dimensional data into k sets is known to be NP-hard. Heuristic 
algorithms, such as /c-means clustering, are therefore used to effi- 
ciently find a local optimum and come with the risk of reporting 
poor solutions. Even if a globally optimal solution was attainable, 
clustering involves minimizing some mathematical criterion, and 
it is very possible that such a criterion will not sufficiently capture 
the features a biologist would use to categorize their data. 

The philosophy behind Spark is to employ a simple and 
computationally efficient clustering algorithm (/c-means) and to 
augment it by allowing the user to interactively guide the output 
according to their expert biological knowledge. This is done 
by enabling interactive cluster splitting whereby a user can run a 
/c-means clustering using k = 2 on only the subset of regions con- 
tained within the selected cluster. An additional discussion of the 
initial choice of k is provided in the Supplemental Material. This 
approach synergizes automated clustering with user feedback to 
produce a more powerful exploration tool. 

Interactive GO analysis 

The functional classification of regions bearing interesting data 
signatures is a natural and common next analysis step. Spark 
supports the interactive analysis of gene ontology (GO) term en- 
richments for each cluster within the GUI. This is achieved 
through interfacing with the DAVID suite of web-based tools 
(Huang et al. 2009). 

Applications 

Epigenetic patterns flanking TSSs 

To validate our approach, we applied Spark to sequencing-based 
histone modification, DNA methylation, and expression data in 
HI human embryonic stem cells (hESCs) (Harris et al. 2010) across 
transcriptional start sites (TSSs) where epigenetic signatures have 



been previously characterized (Lister et al. 2009; Hawkins et al. 
2010). Trimethylation of Histone H3 Lys4 (H3K4me3) or Lys27 
(H3K27me3) have positive and negative regulatory effects on 
transcription, respectively (for review, see Schuettengruber et al. 
2007). These two modifications collocate to form 'bivalent' do- 
mains at the promoters of developmentally important genes in 
embryonic stem cells, serving to silence these genes while keeping 
them poised for lineage-specific activation (Azuara et al. 2006; 
Bernstein et al. 2006). These modifications therefore discriminate 
three main classes of promoters in embryonic stem cells: active, 
repressed, and poised (Mikkelsen et al. 2007). Spark successfully 
recapitulates these classes of TSSs in hESCs (Fig. 2A): From left to 
right, the first cluster is clearly marked with H3K4me3 and pos- 
sesses an RNA-seq signal indicative of transcriptional activity, the 
second cluster bears the bivalent signature of both H3K4me3 and 
H3K27me3, and the third cluster appears transcriptionally in- 
active. Only the transcriptionally active and poised clusters (Fig. 
2A) have notable CpG densities, consistent with previous obser- 
vations that H3K4me3 predominantly localizes to CpG-rich pro- 
moters, suggesting important regulatory differences between pro- 
moters at the two extremes of CpG density (Mikkelsen et al. 2007). 
Using Spark's option to launch DAVID's Functional Annotation 
Tool (Huang et al. 2009), we find that the poised cluster shows 
significant enrichment in the terms 'homeobox' (P < 1 X 10~ 59 ), 
'regulation of transcription' (P < 1 X 10~ 17 ), and 'embryonic 
morphogenesis' (P < 1 X 10~ 31 ), consistent with earlier charac- 
terizations of bivalent domains overlaying developmentally im- 
portant transcription factors (Bernstein et al. 2006). 

These data can be further explored using Spark's interactive 
cluster splitting mechanism. For example, we can interactively 
split the poised cluster to produce two groups, one bearing a much 
broader H3K27me3 signal than the other (Fig. 2B, cl-2-1 and 
cl-2-2). This refined clustering is consistent with a report sug- 
gesting that the minority of bivalent sites contain 'wide' H3K27me3 
signals extending over regions of at least 5 kb, while the majority 
shows punctate H3K27me3 signatures (Mikkelsen et al. 2007). 
Bivalent regions have been reported to be hypomethylated 
(Brunner et al. 2009; Meissner et al. 2008) and in this study, we 
employed a methylation-sensitive restriction enzyme assay (MRE) 
to detect unmethylated CpGs, and a methylation-dependent IP 
procedure (MeDIP) to enrich for methylated CpGs. Intriguingly, 
Spark highlights how closely the absence of DNA methylation, 
indicated by the strong MRE sequencing (MRE-seq) and weak 
MeDIP sequencing (MeDIP-seq) signals, tracks with H3K27me3 
localization at bivalent sites. 

In a similar fashion, cluster splitting can be used to explore 
the transcriptionally inactive class of TSSs (Fig. 2A, c2). This group 
appears to be heterogeneous, with a subcluster displaying a strong 
H3K9me3 signal (Fig. 2B, c2-l-2). This H3K9me3 containing group 
includes several gene clusters, for example, the olfactory receptors 
(ORs) and the late cornified envelope (LCE) gene family, as 
reported recently (Hawkins et al. 2010). The users' ability to direct 
the subclustering in this way allows them to take advantage of 
their biological knowledge to isolate interesting subsets that may 
not have been immediately produced by an automated clustering 
using default parameters and the same end k value. 

Epigenetic patterns around YY1 binding sites 

After validating Spark using previously published histone modifi- 
cation and DNA methylation data from hESCs, we sought to apply it 
to explore the genome-wide profiles of three transcription regulatory 
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Figure 2. Clustering analysis at annotated TSSs. (A) Histogram indicates the number of regions in each cluster, and the overlaid dendrogram traces the 
interactive cluster splitting events (initial clustering with k = 2, followed by one manual split of cluster cl into cl -1 and cl -2). Chromatin modification 
(blue), DNA methylation (green; MeDIP and MRE indicate methylated and unmethylated CpGs, respectively), and RNA-seq (orange) data from HI hESCs 
together with genomic CpG density values (gray) were clustered using a bin size of 300 bp across 6-kb windows centered on RefSeq transcriptional start 
sites (TSSs). (B) Further exploration and interactive refinement of the clusters from A. 



factors and their relationships with particular epigenetic signa- 
tures. This analysis was motivated by the hierarchical recruitment 
model in Drosophila, which suggests that the sequence-specific 
transcription factor, pleiohomeotic (PHO), recruits the polycomb 
repressive complex 2 (PRC2), which in turn trimethylates H3K27 
and leads to the binding of the polycomb repressive complex 1 
(PRC1) (Wang et al. 2004). Polycomb group (PcG) proteins, which 
include PHO and members of PRC1 and PRC2, typically function 
in maintaining transcriptional repression and play essential roles 
in normal development in most multicellular organisms (Morey 
and Helin 2010). While the human ortholog of PHO, YY1 tran- 
scription factor (also known as Yin Yang 1) (YY1), has identical 



DNA binding specificities to PHO in vitro (Brown et al. 1998) and 
can functionally compensate for loss of PHO in pho mutant flies 
(Atchison et al. 2003), it remains unclear whether YY1 plays a role in 
triggering a regulatory cascade that results in H3K27 trimethylation 
and subsequent transcriptional silencing in mammalian cells. 

To investigate this model, we profiled three factors in hESCs 
using ChlP-seq: (1) YY1; (2) a component of PRC2, suppressor of 
zeste 12 (SUZ12); and (3) the corepressor C-terminal binding pro- 
tein 2 (CTBP2), which is thought to play a role in YY1 binding and 
PcG recruitment in fly (Srinivasan and Atchison 2004). Using 
Spark, these ChlP-seq profiles were explored and integrated with the 
previously described DNA methylation and histone modification 
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data from hESCs. To avoid limiting our analysis to annotated 
promoter or enhancer regions, we adopted a data-driven approach 
and took advantage of Spark's flexibility to use the data peaks 
themselves to define the input region set for clustering. For this, 
region boundaries were defined as ±3 kb from each YY1 peak 
center using the top 5% of peaks sorted by peak height. 

The hierarchical recruitment model introduced above pre- 
dicts colocalization of YY1 with H3K27me3. Strikingly no YY1- 
centered cluster shows a strong H3K27me3 signal (Fig. 3). In fact, 
only 1% of the YY1 peaks share an overlap with H3K27me3. This 
trend is robust even when the peak threshold is relaxed (<5% of 
YY1 peaks overlap a H3K27me3 peak in the full set). Our results 
indicate that, at most, only a minority of YY1 binding events are 
involved in H3K27me3 deposition. 

Further exploration of the YY1 -centered clusters in Spark 
suggests that YY1 forms mutually exclusive complexes in hESCs 
with two important coregulators, SUZ12 and CTBP2 (Fig. 3B). 
SUZ12 and YY1 were found to colocalize within intergenic regions 
(cluster cl-1) and gene bodies (clusters cl-1 and cl-2-1) and at 
centromeres and telomeres (cluster cl-2-2). The absence of 
H3K27me3 in these clusters was initially surprising given that 
SUZ12 is a component of PRC2, which has known histone methyl- 
transferase activity. However, this activity is mediated by EZH2, 
which may be absent at these sites. Alternatively, PRC2 can also 
methylate H1K26 (Xu et al. 2010), and it is possible that these sites 
display this mark. Alternatively, YY1 and SUZ12 may colocate with 
a histone demethylase, which has converted the H3K27me3 to 
H3K27me2 or H3K27mel. In subsequent motif analysis performed 
outside of Spark (see Methods), none of these clusters show enrich- 
ment for the canonical YY1 motif, suggesting that further in- 
vestigation is needed to determine whether these patterns arise from 
direct YY1 binding or as a result of an alternate YY1 recruitment 
mechanism. 

In contrast, YY1 motif enrichment (P < 0.0001) is observed at 
sites of colocalization with CTBP2 (cluster c2). These regions dis- 
play strong H3K4me3, H3K9Ac, and RNA expression signals in 
Spark characteristic of transcriptionally active promoters, and 
subsequent comparison to known annotations outside of Spark 
reveals that the majority (88%) of these YY1 peak centers are 
within 2 kb of an annotated TSS. Individual regions can be viewed 
in the region browser (Fig. 3C) or via links to the UCSC Genome 
Browser (Fig. 3D). CTBP2, absence of which is embryonic lethal in 
mice (Hildebrand and Soriano 2002), is typically considered to 
function as a corepressor in mammalian cells (Chinnadurai 2003). 
There exists some evidence that the Drosophila CtBP homolog 
possesses a context-dependent transcriptional activation function 
(for review, see Chinnadurai 2003); however, the observed colo- 
calization with YY1 at transcriptionally active TSSs has not been 
previously reported. GO analysis points to these genes being 
enriched in roles of RNA binding and processing, suggesting po- 
tential novel regulatory roles for CTBP2 and YY1 in hESCs. 

Discussion 

Spark is motivated by the need for data exploration tools that fa- 
cilitate initial investigation of genome-wide data sets by the bi- 
ology community. We recognize that the current paradigm of 
delegating analysis to a comparatively small community of com- 
putational experts will not effectively scale to the analysis de- 
mands of the current and ever-growing data resources. It is essen- 
tial that the broader biology community is able to actively conduct 
initial inquiries and thus formulate the more detailed and bi- 



ologically motivated hypotheses that warrant in-depth inves- 
tigation. Visualization techniques are ideal for such applications in 
that they effectively lower the computational barrier for use while 
providing a powerful mechanism to facilitate human reasoning 
about complex data. We propose a visualization method that 
blends automated clustering with user interaction to provide 
a navigational tool that offers both meaningful data overviews 
and access to the relevant data details on demand. 

The approach embodied in Spark has several strengths: (1) it 
employs a very general clustering technique with few parameters, 
which can flexibly handle diverse data sets; (2) it is not dependent 
on existing annotations, but rather clusters data across a user- 
specified set of input regions that can be known or novel elements; 
(3) it provides an interactive visual interface that enables simul- 
taneous viewing of both genome-scale data signatures and patterns 
at individual loci, providing information about content and vari- 
ation; and (4) it offers users interactive cluster refinement capa- 
bilities, enabling them to dynamically guide the clustering. 

To facilitate using Spark with existing public resources, we 
have integrated the data inventory of the ENCODE Project and the 
Roadmap Epigenomics Project directly into the Spark GUI. We also 
support import of a user's own data in standard formats (wig/ 
bigwig). Following the design philosophy to leverage existing and 
widely used tools, we link each locus in the Spark display to the 
corresponding view in the UCSC genome browser and also in- 
terface with the DAVID GO analysis tools to enable downstream 
functional analysis without the need for programmatic manipu- 
lation. In addition to being available as a standalone software 
package, Spark is also deployed as a service within the Epigenome 
toolset of the Genboree Workbench. 

One natural direction for future work would be to incorporate 
additional clustering techniques into Spark. In particular, methods 
that first identify the subset of data tracks that are most in- 
formative for clustering may be valuable as the number of input 
data tracks grows. However, one insight that emerged while using 
Spark for analysis is that the criteria for defining similarity between 
data patterns can vary greatly depending on the application. 
A researcher may be most interested in regions that show dis- 
tinct positional distributions of data across the query regions, or 
they may be primarily interested in regions with different signal 
amplitudes. There is unlikely to be an optimal distance metric or 
clustering algorithm for all features of biological interest. Rather, 
what seems most promising is to provide easy-to-understand 
clustering methods and then exploit the biologist's knowledge 
and judgment to guide the clustering to construct subsets of 
interest for further inquiry. The interactive cluster manipulation 
functionality currently in Spark is only a first step in this di- 
rection and warrants further investigation. 

Through our application examples using data from the 
ENCODE and Human Epigenome Atlas projects, we have dem- 
onstrated Spark's ability to discover novel data patterns from a di- 
verse collection of genome-wide data types. These signatures were 
not readily apparent through a genome browser view and would 
otherwise have required custom computational manipulation to 
obtain. We anticipate that Spark will be of widespread use in 
exploring these large public data sets and will increase the ac- 
cessibility of these resources to the broader biology community. 
It is also our hope that the navigational paradigm captured in 
Spark will inspire other visualization methods that complement 
traditional genome browsers by offering interactive, high-level, 
functional summaries of genomic data as an entry point for 
exploratory analysis. 
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Figure 3. Clustering analysis of YYl binding sites. (A) Histogram indicates the number of regions in each cluster, and the overlaid dendrogram traces 
the interactive cluster splitting events. (B) ChlP-seq data for YYl, CTBP2, SUZ12, and histone modifications (blue) together with MRE-seq and MeDIP- 
seq (green) and RNA-seq (orange) data from HI hESCs were clustered using a bin size of 300 bp across 6-kb windows centered on sites of YYl ChlP-seq 
enrichment. (C) Scrollable region browser: Data from individual regions within the currently selected cluster (c2) can be interactively viewed (five regions 
displayed at one time, r1-r5). (D) A context menu provides a hyperlink to the corresponding region display within the UCSC Genome Browser (view of r1 
shown). 
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Methods 
ChlP-seq 

Human embryonic stem cells (hESCs) were obtained from Cellular 
Dynamics as part of a large batch of cells prepared for the ENCODE 
Consortium and the RoadMap Epigenome Consortium. Cell growth 
and crosslinking conditions can be found at http://www.genome. 
ucsc.edu/ENCODE/cellTypes.html. ChlP-seq experiments for the 
histone modifications have been described previously (Harris 
et al. 2010). The YY1 and SUZ12 ChIP assays were performed 
using 5 X 10 7 cells per assay, and 28 u,g chromatin was used for the 
CTBP2 ChIP assay. ChIP assays were performed following the 
protocol provided at http://farnham.genomecenter.ucdavis.edu/ 
pdf/FarnhamLabChIP%20Protocol.pdf, except that StaphA cells 
were blocked only with BSA before use and the preclearing 
step was omitted. The antibodies used were as follows: SUZ12 
(Kirmizis et al. 2004), YY1 (Santa Cruz Biotechnology, SC-1703X), 
and CTBP2 (BD Biosciences 612044). All ChIP and input samples 
(10% of the amount of chromatin used per ChIP) were purified 
using the QIAquick PCR purification kit (QIAGEN) according to 
manufacturer's instructions, and purified eluates were dissolved 
in 50 (jlL of water. ChIP libraries were created and sequenced 
according to the method described previously (Harris et al. 2010) 
with the YY1, SUZ12, and CTBP2 libraries sequenced by the DNA 
Technologies Core Facility at the University of California-Davis 
(http://genomecenter.ucdavis.edu/dna_technologies/). 

DNA methylation assays and RNA-seq 

Methylation dependent immunoprecipitation and sequencing 
(MeDIP-seq), methylation sensitive restriction enzyme sequencing 
(MRE-seq), and RNA-seq were performed as previously described 
(Harris et al. 2010). 

Data processing 

Illumina read sequences (75 bp) were aligned to the reference human 
genome (hgl8) using BWA (Li and Durbin 2009). FindPeaks 4.0.15 
(Fejes et al. 2008) was subsequently used to detect enrichment peaks 
at an FDR of 0.01. 

Spark 

Input data files were provided in wig format and input region co- 
ordinates specified in GFF3 format, fc-means clustering was per- 
formed on 6-kb windows centered on Refseq TSSs. Any TSS having 
a neighboring TSS within 3 kb was removed from the set prior to 
clustering. For the YY1 analysis, clustering was performed on 6-kb 
windows centered on high-confidence YY1 peaks (the top 5% 
sorted by maximal peak height). Data values were normalized to be 
between 0.0 and 1.0, according to the method described by Hon 
et al. (2008), and /c-means clustering was computed using Euclid- 
ean distance. Spark version 1.1.0 was used for all analyses. 

Motif analysis 

Motif finding was performed using the W-ChlPMotifs web applica- 
tion (http://motif.bmi.ohio-state.edu/ChIPMotifs/) (Jin et al. 2009), 
and Bonfenoni-corrected P-values are reported. 

Data access 

Data used in this article have been submitted to the NCBI Gene 
Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) 
under accession numbers CTBP2, GSM935463; H3K27me3, 



GSM428295; H3K36me3, GSM428296; H3K4mel, GSM434762; 
H3K4me3, GSM410808; H3K9Ac, GSM410807; H3K9me3, 
GSM428291; MRE-seq, GSM428286; MeDIP-seq, GSM456941; 
RNA-seq, GSM484408; SUZ12, GSM935352; and YY1, GSE39096. 
These data are also available from the Human Epigenome Atlas 
(http://www.epigenomeatlas.org) and the ENCODE data listings 
at the UCSC Genome Browser site (http://hgdownload.cse.ucsc. 
edu/goldenPath/hgl9/encodeDCC/wgEncodeSydhTfbs/). 
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